Method and apparatus for extracting and evaluating mutually similar portions in one-dimensional sequences in molecules and/or three-dimensional structures of molecules

ABSTRACT

In the analysis of one-dimensional sequences of molecules, the longest common subsequence, the number of elements constituting the subsequence, and appearance positions of the subsequence are determined by a novel and simple method, and processes, such as homology decision, homology search, motif search and alignment are performed based on the results. In the analysis of these-dimensional structures of molecules, limiting conditions, such as geometrical arrangements of elements, are introduced to realize the determination of correspondence of three-dimensional structures at high speeds, and whereby it is made possible to achieve such processing as superposed display of three-dimensional structure of molecules, retrieval of three-dimensional structure, and evaluation of functions. Moreover, the molecules are divided into secondary structure that are then related to each other based on spatial similarity among the secondary structures. Furthermore, similarity among the molecules is decided based on a relationship of spatial positions of the corresponding secondary structures.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] This invention relates to a method and apparatus for extractingand evaluating mutually coinciding or similar portions between sequencesof atoms or atomic groups in molecules and/or between three-dimensionalstructures of molecules and, particularly to a method and apparatus forautomatically extracting and evaluating mutually coinciding or similarportions between amino acid sequences n protein molecules and/or betweenthree-dimensional structures of protein molecules.

[0003] 2. Description of the Related Art

[0004] A gene is in substance DNA, and is expressed as a base sequenceincluding four bases of A (adenine), T (thymine), C (cytosine), and G(guanine). There are about twenty types of amino acids constituting anorganism, and it has been shown that arrangements of three basescorrespond to the respective amino acids. Accordingly, it has been foundout that the amino acids are synthesized according to the base sequencesof the DNA in the organism and that a protein is formed by folding thesynthesized amino acids. The arrangement of amino acids is expressed asan amino acid sequence in which the respective amino acids are expressedin letters similar to the base sequence.

[0005] A method for determining a sequence of bases and amino acids hasbeen established together with the development of molecular biology, andtherefore a huge amount of gene information including a base sequencedata and an amino acid sequence data has been stored. Thus, in the fieldof gene information processing, a core subject has been how to extractbiological information concerning the structure and function of theprotein out of the huge amount of stored gene information.

[0006] A basic technique in extracting the biological information is tocompare the sequences. This is because it is considered that asimilarity is found in the biological function if the sequences aresimilar. Accordingly, by searching a data base of known sequences whosefunctions are known for a sequence similar to an unknown sequence ahomology search for estimating a function of an unknown sequence, and analignment such that a sequence is rearranged so as to maximize thedegree of analogy between the compared sequences when researcherscompare the sequences are presently studied.

[0007] Further, it is considered that a region of the sequence, in whicha function important for the organism is coded, is perpetuated in theevolution process. For instance, a commonly existing sequence pattern(region) is known to be found when the amino acid sequences in proteinshaving the same function are compared between different types oforganisms. This region is called a motif. Accordingly, if it is possibleto extract the motif automatically, the property and function of theprotein can be shown by finding which motif is included in the sequence.Further, the automatic motif extraction is applicable to a variety ofprotein engineering fields such as strengthening of the properties ofthe preexisting proteins, addition of functions to the preexistingproteins, and synthesis of new proteins. As described above, it can beconsidered as an effective means in extracting the biologicalinformation to extract the motif out of the amino acid sequence.However, the extracting method is not yet established, and theresearchers currently decide manually which part is a motif sequenceafter the homology search and alignment.

[0008] A dynamic programming technique that is used in a voicerecognition processing has been the only method used for automaticallycomparing two amino acid sequences.

[0009] However, according to the method of comparing the amino acidsequences using the dynamic programming technique, the amino acidsequences are compared two-dimensionally. Thus, this method requires alarge memory capacity and a long processing time.

[0010] Meanwhile, in the fields of physics and chemistry, in order toexamine the properties of a new (unknown) substance and to produce thenew substance artificially, three-dimensional structures of substancesare determined by a technique such as an X-ray crystal analysis or anNMR analysis, and information on the determined three-dimensionalstructures is stored in a data base. As a typical data base, a PDB(Protein Data Bank) in which three-dimensional structures of proteins orthe like identified by the X-ray crystal analysis of protein areregistered is widely known and universally used. Further, a CSD(Cambridge Structural Database) is known as a data base in whichchemical substances are registered.

[0011] In the protein, a plurality of amino acids are linked to oneanother as a single chain and this chain is folded in an organism tothereby form a three-dimensional structure. In this way, the proteinexhibits a variety of functions. The respective amino acids areexpressed by numbering them from an N-terminal through a C-terminal.These numbers are called amino acid numbers, amino acid sequencenumbers, or amino acid residue numbers. Each amino acid includes aplurality of atoms according to the type thereof. Therefore, there areregistered names and administration numbers of protein, amino acidnumbers constituting the protein, types and three-dimensionalcoordinates of atoms constituting the respective amino acids, and thelike in the PDB.

[0012] It is known that the three-dimensional structure of the substanceis closely related to the function thereof from the result of chemicalstudies conducted thus far, and a relationship between thethree-dimensional structure and function is shown through a chemicalexperiment in order to change the substance and to produce a substancehaving anew function. Particularly, since a structurally similar portion(or a specific portion) between the substances having the same functionis considered to influence the function of the substance, it isessential to discover a similar structure commonly existing in thethree-dimensional structures.

[0013] However, since there is no method of extracting a characteristicportion directly from the three-dimensional coordinate, the researchersare at present compelled to express the respective three-dimensionalstructures in a three-dimensional graphic system and to search thecharacteristic portion manually. There is in general no method ofdetermining an orientation of the substance as a reference, whichrequires a substantial amount of time.

[0014] When the researcher searches the similar three-dimensional,structure, an r.m.s.d. (root mean square distance) value is used as ascale of the similarity of the three-dimensional structures of thesubstances. The r.m.s.d. value is a value expressing a square root of amean square distance between the corresponding elements constituting thesubstances. Empirically, the substances are thought to be exceedinglysimilar to each other in the case where the r.m.s.d value between thesubstances is not greater than 1Å.

[0015] For instance, it is assumed that there are substances expressedby a point set A={a₁, a₂, . . . , a_(i), . . . , a_(m)} and a point setB={b₁, b₂, . . . , b_(j), . . . , b_(n)}, wherein a_(i) (i=1, 2, . . . ,m) and b_(j) (i=1, 2, . . . , n) are vectors expressing positions of therespective elements in the three-dimensional space. The elementsconstituting these substances A and B are related to each other, and thesubstance B is rotated and moved so that the r.m.s.d value between thecorresponding elements is minimized. For example, if a_(k) is related tob_(k) (k=1, 2, . . . , n), the r.m.s.d value is obtained in thefollowing equation (1) wherein U denotes a rotation matrix and W_(k)denote respective weights: $\begin{matrix}{{r.m.s.d.} = \frac{\left( {\sum\limits_{k = 1}^{n}\left( {w_{k}\left( {{Ub}_{k} - a_{k}} \right)}^{2} \right)} \right)^{\frac{1}{2}}}{n}} & (1)\end{matrix}$

[0016] A technique of obtaining the rotation and movement of thesubstances, which minimizes the r.m.s.d value between thesecorresponding points, is proposed by Kabsh et al. (for example, refer to“A Solution for the Best Rotation to Relate Two Sets of Vectors,” by W.Kabsh, Acta Cryst. (1976), A32, 923), and is presently widely used.However, since the same number of points are compared according to thismethod, the researchers are presently studying, by trial and error,which combinations of elements are related to the other substances so asto obtain the minimum r.m.s.d value.

[0017] Further, it is necessary to study the preexisting substances inorder to produce the new substance. For instance, in the case where theheat resistance of a certain substance is preferably strengthened, astructure commonly existing among the strong heat resisting substancesis determined, and such a structure is added to a newly producedsubstance to thereby strengthen the function of the substance. To thisend, such a function is required as to retrieve the necessary structurefrom the data base. However, the researchers are presently studying thenecessary structure from the data base, by trial and error, using thecomputer graphic system for the aforementioned reasons.

[0018] As described above, the operators are compelled to graphicallydisplay the three-dimensional structure of the substance they want toanalyze using the graphic system, and to analyze by visual comparisonwith other molecules on a screen, superposition, and like operations.

[0019] Meanwhile, basic structures such as an α helix and a β strand arecommonly found in the three-dimensional structure of protein, and theyare called a secondary structure. Methods of carrying out an automaticsearch by a similarity of the secondary structure without using ther.m.s.d. value have been considered. According to these methods, apartial structure is expressed by symbols of the secondary structuresalong the amino acid sequence and the comparison is made using thesesymbols. Therefore, the comparison could not be made according to asimilarity of the spatial positional relationship of the partialstructure.

[0020] As mentioned above, the case where the three-dimensionalstructure of the substance is analyzed using the CSD and PDB, a greatamount of time and labor are required to manually search a huge amountof data for a structure and to compare the retrieved structure with thethree-dimensional structure to be analyzed, thereby imposing a heavyburden on the operators. For that matter, the data included n the database cannot be utilized effectively, thus presenting the problem thatthe structure of the substance cannot be analyzed sufficiently.Accordingly, there has been the need for a retrieval system thatretrieves the structure based on the analogy of the three-dimensionalstructures of the three-dimensional structure data base.

SUMMARY OF THE INVENTION

[0021] An object of the invention is to provide method and apparatuscapable of automatically extracting and evaluating mutually coincidingor similar portions between sequences of atoms or atomic groups inmolecules such as protein molecules in accordance with a simpleprocessing mechanism.

[0022] Another object of the invention is to provide method andapparatus capable of automatically extracting and evaluating a mutuallycoinciding or similar portions between three-dimensional structures ofthe molecules such as protein molecules.

[0023] In accordance with the present invention there is provided amethod of analyzing sequences of atomic groups including a firstsequence having m atomic groups and a second sequence having n atomicgroups where m and n are integers, comprising the steps of:

[0024] a) preparing an array S[i] having array elements S[O] to S[m];

[0025] b) initializing all array elements of the array S[i] to zero andinitializing an integer j to l;

[0026] c) adding to l to each array element S[i] tha tis equal to anarray element S[r] and that i≳r if the array element S[r] is equal to anarray element S[r−l] where r is an occurrence position of j-th atomicgroup of the second sequence in the first sequence;

[0027] d) adding l to the integer j;

[0028] e) repeating the steps c) and d) until the integer j exceeds n;and

[0029] f) obtaining a longest common atomic group number between thefirst and the second sequences from a value of the array element S[m].

[0030] It is preferable that the method further comprises the steps of:

[0031] g) preparing an array data[k] having array elements data[0],data[1]. . . ;

[0032] h) storing paired data (r, j) in an array element data[k] if thearray element S[i] is changed in the step c) where k=s[r];

[0033] i) linking the paired data (r, j) stored in the step h) to paireddata (r′, j′) if r′<r and j′<j where the paired data (r′, j′) is onestored in an array element data[k−1]; and

[0034] j) obtaining a longest common subsequence between the first andthe second sequences and occurrence positions of the longest commonsubsequence in the first and the second sequence by tracing the linkformed in the step i).

[0035] In accordance with the present invention there is also provided amethod of analyzing three-dimensional structures including a firststructure expressed by three-dimensional coordinates of elementsbelonging to a first point set and a second structure expressed bythree-dimensional coordinates of elements belonging to a second pointset, comprising the steps of:

[0036] a) generating a combination of correspondence satisfying arestriction condition between the elements belonging to the first pointset and the elements belonging to the second point set from among allcandidates for the combination of correspondence; and

[0037] b) calculating a root mean square distance between the elementscorresponding in the combination of correspondence generated in the stepa).

[0038] In accordance with the present invention there is also provided amethod of analyzing three-dimensional structures including a firststructure expressed by three-dimensional coordinates of elementsbelonging to a first point set and a second structure expressed bythree-dimensional coordinates of elements belonging to a second pointset, comprising the steps of:

[0039] a) dividing the second point set into a plurality of subsetshaving a size that is determined by the size of the first point set;

[0040] b) generating a combination of correspondence satisfying arestriction condition between the elements belonging to the first pointset and the elements belonging to each of the subsets of the secondpoint set from among all candidates for the combination ofcorrespondence; and

[0041] c) calculating a root mean square distance between the elementscorresponding in the combination of correspondence generated in the stepb).

[0042] In accordance with the present invention there is also provided amethod of analyzing three-dimensional structures including a firststructure expressed by three-dimensional coordinates of elementsbelonging to a first point set and a second structure expressed bythree-dimensional coordinates of elements belonging to a second pointset, comprising the steps of:

[0043] a) dividing the first point set and the second point set intofirst subsets and second subsets, respectively, according to a secondarystructure exhibited by the three-dimensional coordinates of the elementsof the first and the second point sets;

[0044] b) generating a combination of correspondence satisfying a firstrestriction condition between the first subsets and the second subsetsfrom among candidates for the combination of correspondence;

[0045] c) determining an optimum correspondence between the elementsbelonging to each pair of subsets corresponding in the combination ofcorrespondence generated in the step b), and

[0046] d) calculating a root mean square distance between all of theelements corresponding in the optimum correspondence in the step c).

[0047] In accordance with the present invention there is also providedan apparatus for analyzing sequences of atomic groups including a firstsequence having m atomic groups and a second sequence having n atomicgroups where m and n are integers, comprising:

[0048] means for preparing an array S[i] having array elements S[0] toS[m];

[0049] means for initializing all array elements of the array S[i] tozero and initializing an integer j to 1;

[0050] means for renewing the array S[i] by adding 1 to each arrayelement S[i] that is equal to an array element S[r] and that i≧r if thearray element S[r] is equal to an array element S[r−1] where r is anoccurrence position of j−th atomic group of the second sequence in thefirst sequence;

[0051] means for incrementing the integer j by 1;

[0052] means for repeatedly activating the renewing means and theincrementing means until the integer j exceeds n; and

[0053] means for obtaining a longest common atomic group number betweenthe first and the second sequences from a value of the array elementS[m].

[0054] It is preferable that the apparatus further comprises:

[0055] means for preparing an array data[k] having array elementsdata[0], data[1]. . . ;

[0056] means for storing paired data (r, j) in an array element data[k]if the array element S[i] is changed by the renewing means where k=S[r];

[0057] means for linking the paired data (r, j) stored by the storingmeans to paired data (r′, j′) if r′<r and j′<j where the paired data(r′, j′) is one stored in an array element data[k−1]; and

[0058] means for obtaining a longest common subsequence between thefirst and the second sequences and occurrence positions of the longestcommon subsequence in the first and the second sequence by tracing thelink formed by the linking means.

[0059] In accordance with the present invention there is provided anapparatus for analyzing three-dimensional structures including a firststructure expressed by three-dimensional coordinates of elementsbelonging to a first point set and a second structure expressed bythree-dimensional coordinates of elements belonging to a second pointset, comprising:

[0060] means for generating a combination of correspondence satisfying arestriction condition between the elements belonging to the first pointset and the elements belonging to the second point set from among allcandidates for the combination of correspondence; and

[0061] means for calculating a root mean square distance between theelements corresponding in the combination of correspondence generated bythe generating means.

[0062] In accordance with the present invention there is provided anapparatus for analyzing three-dimensional structures including a firststructure expressed by three-dimensional coordinates of elementsbelonging to a first point set and a second structure expressed bythree-dimensional coordinates of elements belonging to a second pointset, comprising the steps of:

[0063] means for dividing the second point set into a plurality ofsubsets having a size that is determined by the size of the first pointset;

[0064] means for generating a combination of correspondence satisfying arestriction condition between the elements belonging to the first pointset and the elements belonging to each of the subsets of the secondpoint set from among all candidates for the combination ofcorrespondence; and

[0065] means for calculating a root mean square distance between theelements corresponding in the combination of correspondence generated bythe generating means.

[0066] In accordance with the present invention there is also providedan apparatus for analyzing three-dimensional structures including afirst structure expressed by three-dimensional coordinates of elementsbelonging to a first point set and a second structure expressed bythree-dimensional coordinates of elements belonging to a second pointset, comprising:

[0067] means for dividing the first point set and the second point setinto first subsets and second subsets, respectively, according to asecondary structure exhibited by the three-dimensional coordinates ofthe elements of the first and the second point sets;

[0068] means for generating a combination of correspondence satisfying afirst restriction condition between the first subsets and the secondsubsets from among candidates for the combination of correspondence;

[0069] means for determining an optimum correspondence between theelements belonging to each pair of subsets corresponding in thecombination of correspondence generated in the generating means, and

[0070] means for calculating a root mean square distance between all ofthe elements corresponding in the optimum correspondence.

BRIEF DESCRIPTION OF THE DRAWINGS

[0071]FIG. 1 is a block diagram showing a construction of a geneinformation survey apparatus according to an embodiment of the presentinvention;

[0072]FIG. 2 is a flowchart showing a process for detecting a longestcommon character number in a LCS detection unit of FIG. 1;

[0073]FIGS. 3 and 4 are flowcharts showing a process for detecting anLCS and occurrence positions thereof in the LCS detection unit;

[0074]FIG. 5 is a diagram of an example of the table of occurrencepositions generated in the LCS detection unit;

[0075]FIG. 6 is a diagram explaining an example of the operation of theLCS detection unit;

[0076]FIG. 7 is a diagram showing a linked data structure generated inthe LCS detection unit;

[0077]FIG. 8 is a flowchart showing the linked data structure tracingoperation;

[0078]FIG. 9 is a flowchart showing an operation of a retrieval processcalled in the tracing operation;

[0079]FIG. 10 is a diagram showing an example of output results of thegene information survey apparatus;

[0080]FIG. 11 is a diagram showing another example of output results ofthe apparatus;

[0081]FIG. 12 is a diagram showing another example of output results ofthe apparatus;

[0082]FIGS. 13A to 13D are diagrams showing the determination ofcorrespondence of partial three-dimensional structures;

[0083]FIGS. 14A and 14B are diagrams showing tree structures expressingcandidates for a combination of correspondence between elements of twononordered point sets,

[0084]FIG. 15 is a flowchart showing an algorithm for generating acombination of correspondence between two nonordered point sets;

[0085]FIGS. 16A and 16B are diagrams showing tree structures expressingcandidates for a combination of correspondence between elements of twoordered point sets;

[0086]FIG. 17 is a flowchart showing an algorithm for generating acombination of correspondence between two ordered point sets;

[0087]FIG. 18 is a diagram showing a tree structure expressingcandidates for a combination of correspondence between elements of twoordered point sets that are partially related to each other;

[0088]FIG. 19A and 19B are diagrams explaining the refining ofcandidates using a distance relationship;

[0089]FIGS. 20A and 20B are diagrams explaining refining of candidatesusing an angle relationship;

[0090]FIG. 21 is a diagram showing a tree structure explaining therefining of candidates using a restriction condition of the number ofnil elements;

[0091]FIG. 22 is a block diagram showing a construction of a molecularstructure display device according to another embodiment of the presentinvention;

[0092]FIGS. 23A and 23B are diagrams showing amino acid sequences ofcalmodulin and troponin C, respectively;

[0093]FIGS. 24A and 24B are diagrams showing three-dimensionalstructures of calmodulin and troponin C, respectively;

[0094]FIG. 25 is a diagram showing an example of output results of thedevice of FIG. 22;

[0095]FIG. 26 is a diagram showing another example of output results ofthe device of FIG. 22;

[0096]FIG. 27 is a block diagram of a construction of athree-dimensional structure retrieval device according to anotherembodiment of the present invention;

[0097]FIG. 28 is a diagram showing a construction of a function database generating device according to another embodiment of the presentinvention;

[0098]FIG. 29 is a diagram showing an example of output results of thedevice of FIG. 27;

[0099]FIG. 30 is a diagram showing the retrieval results asthree-dimensional structures;

[0100]FIG. 31 is a block diagram showing a construction of a functionpredicting device according to another embodiment of the presentinvention;

[0101]FIGS. 32A and 32B are diagrams showing linear structures andnon-linear structures, respectively;

[0102]FIG. 33 is a diagram explaining the division of a point set B intosubsets according to the number of elements belonging to a point set A;

[0103]FIG. 34 is a flowchart showing a process for dividing a point setB into subsets according to the number of elements belonging to a pointset A;

[0104]FIGS. 35A and 35B are diagrams explaining the division of a pointset B into subsets according to a spatial size of a point set A;

[0105]FIG. 36 is a flowchart showing an example of a process fordividing a point set B into subsets according to a spatial size of apoint set A;

[0106]FIG. 37 is a flowchart showing another example of the process fordividing a point set B into subsets according to a spatial size of apoint set A;

[0107]FIGS. 38A and 38B are diagrams showing amino acid sequences oftrypsin and elastase, respectively;

[0108]FIGS. 39A and 39B are diagrams showing retrieval results ofthree-dimensional structures;

[0109]FIG. 40 is a diagram showing a tree structure expressingcandidates for a combination of correspondence between subsets;

[0110]FIG. 41 is a flowchart showing a process of determiningcorrespondence between subsets;

[0111]FIG. 42 is a block diagram showing a construction of retrievalprocess device according to another embodiment of the present invention;

[0112]FIG. 43 is a flowchart showing a process of dividing a point setinto subsets according to secondary structures;

[0113]FIG. 44 is a diagram showing the results of the division of apoint set into subsets according to secondary structures;

[0114]FIG. 45 is a flowchart showing a process for retrieving proteinsusing a method of dividing into subsets according to secondarystructures;

[0115]FIG. 46 is a diagram showing an output result of a similarretrieval structure using a protein as a retrieval key; and

[0116]FIGS. 47A and 47B are diagrams showing a protein having a similarstructure retrieved by a key protein.

DESCRIPTION OF THE PREFERRED EMBODIMENTS Analysis of one-dimensionalsequences of molecules

[0117]FIG. 1 shows a gene information survey apparatus 1 according to anembodiment of the invention. In FIG. 1, the reference numeral 40 denotesinput device connected tot eh gene information survey apparatus 1; thereference numeral 41 denotes an interactive device such as a keyboardand a mouse provided in the input device 40; the reference numeral 42denotes a display device connected to the gene information surveyapparatus 1; the reference numeral 50 denotes an amino acid sequencedata base for storing amino acid sequence information expressed bycharacter sequences; and the reference numeral 60 denotes a motif database for storing motif sequence information expressed by a charactersequence.

[0118] The gene information survey apparatus 1 of this embodimentincludes an LCS detection unit 30, a homology decision unit 31, ahomology search unit 32, a motif seach unit 33, an alignment unit 34,and a display control unit 35.

[0119] The LCS detection unit 30 determines an LCS (Longest CommonSubsequence), the length of LCS, and an occurrence position of the LCSbetween a character sequence expressing an amino acid sequence inputfrom the input device 40 and a character sequence expressing an aminoacid sequence taken from the amino acid sequence data base 50 or motifdata base 60. The LCS is the longest subsequence among those whichcommonly occur continuously or intermittently in both charactersequences, and the longest common character number is the number ofcharacters constituting the LCS.

[0120] The homology decision unit 31 determines the analogy between thetwo amino acid sequences surveyed by the LCS detection unit 30 based onthe detection result of the LCS detection unit 30. A homology searchunit 32 seaches the amino acid sequence data base 50 for an amino acidsequence similar to the amino acid sequence input from the input device40 based on the decision result of the homology decision unit 31. Themotif search unit 33 searches the motif data base 60 for a motifsequence similar to the amino acid sequence input from the input device40 based on the detection result of the LCS detection unit 30. Thealignment unit 34 aligns the character sequence of the amino acidsequence input from the input device 40 with the character sequence ofthe amino acid sequence given from the amino acid sequence data base 50or motif data base 60 based on the detection result of the LCS detectionunit 30. The display control unit 35 displays the processing results ofthe respective processing units in the display device 42.

[0121] A processing carried out by the LCS detection unit 30 inaccordance with processing flows shown in FIGS. 2 to 4 will be describedin detail. The processing flow shown in FIG. 2 is carried out to detectthe length of LCS between the two amino acid sequences to be surveyed.The processing flow shown in FIGS. 3 and 4 is carried out to detect thelongest common subsequence LCS between the two amino acid sequences tobe surveyed and the occurrence position thereof.

[0122] In detecting the length of LCS between the amino acid sequencesexpressed by a character sequence I and a character sequence II, the LCSdetection unit 30 reads the characters individually from the charactersequence I and generates an occurrence table indicative of theoccurrence positions of the respective characters in the charactersequence I n the Step 1 as shown in the processing flow of FIG. 2.

[0123] This occurrence table is generated, for example, by linking arrayelements P[1] to P[26} corresponding to alphabets A to Z with data ofthe occurrence positions of the respective characters by pointers 62, asshown in FIG. 5. For instance, in the case where the amino acid sequenceof the character sequence I is expressed as “ABCBDAB,” the occurrencetable is generated such that “A” occurs in the sixth and first places;“B” occurs in the seventh, fourth, and second places; “C” occurs in thethird place; and “D” occurs in the fifth place. In Step 1, an array S[i]having the same size as the character sequence I, which is used in thesubsequent processing, is initialized and a zero value is set in eachentry.

[0124] In Step 2, the characters are successively read from thecharacter sequence II and the occurrence positions r of these charactersin the character sequence I is specified with reference to theoccurrence table generated in Step 1. Subsequently, in Step 3, it isdetermined whether an entry data of S[r], which is in the r−th place ofthe array S[i], is equal to an entry data of S[r−1], which is in the(r−1) th place thereof.

[0125] If it is determined that S[r]=S[r−1] in Step 3, Step 4 follows inwhich “1” is added to S[i] where i≧r and whose entry data is equal tothat of S[r−1]. Subsequently in Step 5, it is determined whether theprocessing has been completed up to the last character of the charactersequence II. If the determination result is in the negative in Step 5,this routine returns to Step 2. On the other hand, if it is determinedthat S[r]≠[Sr−1] in Step 3, this routine proceeds to Step 5 immediatelywithout executing the additional processing in Step 4.

[0126] In the case where the characters of the character sequence IIread in Step 2 occur in the character sequence I a plurality of times,the processing of Step 3 and 4 are repeated in decreasing order of theoccurrence positions r.

[0127] If it is determined that the processing has been completed up tothe last character of the character sequence II, this routine proceedsto Step 6 in which an entry data Kmax of a last element S[m] of thearray S[i] is output as a the length of LCS.

[0128] In executing the above processing flow, for example, in the casewhere the amino acid sequence of the character sequence I is expressedas “ABCBDAB” and that of the character sequence II is expressed as“BDCABA,” “r=7, 4, 2” is specified from a list following the arrayelement P[2] out of the occurrence table shown in FIG. 5 in accordancewith the reading of the first character B (j=1) of the charactersequence II, and the entry data of the array S[i] is renewed as shownsequentially from the occurrence table shown in FIG. 5 in accordancewith the reading of the second character D (j=2) of the charactersequence II, and the entry data of the array S[i] is renewed as shown inFIG. 6, “r=3” is specified in accordance with the reading of the thirdcharacter C (j=3) of the character sequence II, and the entry data ofthe sequence S[i] is renewed as shown in FIG. 6. “r=6, 1” is specifiedin accordance with the reading of the fourth character A (j=4) of thecharacter sequence II, and the entry data of the sequence S[i] isrenewed as shown in FIG. 6. it should be noted that the respective entryvalues of S[i] set in this manner give the length of LCS between acharacter subsequence consisting of the first to i−th characters of thecharacter sequence I and the character subsequence consisting of thefirst to j-th characters of the character sequence II after the j-thcharacter of the character sequence II is processed.

[0129] Thereafter, “r=7, 4, 2” is specified from the occurrence tableshown in FIG. 5 in accordance with the reading of the fifth character Bof the character sequence II, and the entry data of the array S[i] isrenewed as shown in FIG. 6. “r=6, 1” is specified from the occurrencetable shown in FIG. 5 in accordance with the reading of the sixthcharacter A of the character sequence II, and the entry data of thesequence S[i] is renewed as shown in FIG. 6. Lastly, the length of LCS“4” is obtained in S[7]. It should be noted that the array S[i] shown inFIG. 6 additionally includes S[0] for the sake of convenience, andtherefore has a size that is larger than the length of the charactersequence I(=7) by one.

[0130] The processing to determine the longest common subsequencebetween the two amino acid sequences to be surveyed and the occurrenceposition thereof will be described with reference to FIGS. 3 and 4.

[0131] The LCS detection unit 30 successively reads the characters fromthe character sequence I and generates an occurrence table indicative ofthe occurrence positions of the respective characters in the charactersequence I in Step 10 as shown in the processing flow of FIG. 3 indetecting the longest common subsequence between the amino acidsequences expressed by the character sequences I and II and theoccurrence position thereof. In other words, the occurrence tabledescribed with reference to FIG. 5 is generated. In Step 10, an arrayS[i] having the same size as the character sequence I, which is used inthe subsequent processing, is initialized and a zero value is set ineach entry. Further, an array data [k] having the size corresponding tothe length of LCS is initialized and the respective entries are set soas not to point to anything.

[0132] In Step 11, one character (j-th character) is read from thecharacter sequence II, and the occurrence position r of this characterin the character sequence I is specified with reference to theoccurrence table generated in Step 10. Subsequently, in Step 12, it isdetermined whether an entry data of S[r], which is in the r−th place ofthe array S[i], is equal to an entry data of S[r−1], which is in the(r−1)th place of the sequence S[i]. If it is determined that S[r]=S[r−1]in Step 12, Step 13 follows in which “1” is added to S[i] where 1≧r andwhose entry data is equal to that of S[r−1]. On the other hand, if it isdetermined that S[r]≠S[r−1] n Step 12, this routine proceeds to Step 17of processing flow of FIG. 5 without executing the additional processingin Step 13. In the case where the characters f the character sequence IIread in Step 11 occur in the character sequence I a plurality of times,the processing of the Steps 12 and 13 are repaired in decreasing orderof the occurrence positions r.

[0133] In this way, the LCS detection unit 30 also executes theprocessing so as to detect the length of LCS described in the processingflow of FIG. 2 in detecting the longest common subsequence.

[0134] After execution of the processing of Step 13, paired data (r, j)including the occurrence position r in the character sequence I and theoccurrence position j in the character sequence II is stored in thearray data[k] in Step 14 in accordance with the length of LCS k, whichis obtained in entry data of S[r]. In fact, the paired data (r, j) isstored at the last of the list linked to the array data[k]. If the arrayS[i] is unchanged from the one in the preceding processing cycle, theabove storing processing is not executed.

[0135] Subsequently, this routine proceeds to the processing flow ofFIG. 4 and, in Step 15, it is determined whether relationships r′<r,j′<j are satisfied with respect to each of the character positions r′,j′ stored in the data[k−1]. Since the character positions cannot bereversed in the subsequences, the above relationship must be satisfiedalong a subsequence. Therefore, the data (r, j) is linked to the data(r′, j′) in Step 16, only when the above relationship is satisfactory.In subsequent Step 17, it is determined whether the processing has beencompleted up to the last character of the character sequence II. If thedetermination result is in the negative in Step 17, this routine returnsto Step 11 of the processing flow shown in FIG. 3. On the other hand, ifit is determined that the above relational expressions are not satisfiedin Step 15, this routine proceeds to Step 17 without executing theprocessing of Step 16.

[0136] If it is determined that the processing has been completed up tothe last character of the character sequence II in Step 17, thisprocessing flow ends. The longest common subsequence and the occurrenceposition thereof are determined by tracing back the link set in Step 16,as will be described in detail later.

[0137] An example of the processing shown in FIGS. 3 and 4 will bedescribed with respect to a case where a first amino acid sequence isexpressed by the character sequence I “ABCBDAB” and a second amino acidsequence is expressed by the character sequence II “BDCABA” similar tothe aforementioned example.

[0138] As shown at a left end of FIG. 6, since r=7, j=1, and k=1 whenS[r] is first renewed in Step 13 of FIG. 3, data (7, 1) is stored in adata[1] by being linked thereto in Step 14 of FIG. 3 as shown in FIG. 7.Thereafter, data (4, 1), (2, 1) are stored.

[0139] Since nothing is stored in a data[0] set, for the sake ofconvenience, the processing of Step 16 is not applied thereto. SinceS[r] is renewed when r=5, j=2, and k=2, data (5, 2) is stored in adata[2] as shown in FIG. 7. In Step 15, the relationships r′<r and j′<jare satisfied for the data (4, 1) and (2, 1) among the data (7, 1), (4,1), and (2, 1) stored in the data[1]. Accordingly, the data (5, 2) islinked to the data (4, 1) and (2, 1) through pointers 70, 72 shown inFIG. 7 in Step 16. By repeating the aforementioned processing, a linkedlist shown in FIG. 7 is generated. As shown at the right side of FIG. 6,the data (1, 6) is not stored in the data[k] since S[r] is unchangedwhen r=1 and j=6.

[0140] The longest common subsequence and the occurrence positionthereof are determined by tracing back the pointers of the characterposition information stored in the data[k]. If this is explained morespecifically using the example of FIG. 7, the link “(7, 5) of thedata[4]→(6, 4) of the data[3]→(5, 2) of the data[2]→(4, 1) of the data[1]” is traced and arranged in reverse order, thereby determining thelongest common subsequence BDAB and the occurrence positions in thecharacter sequences I and II. Also, the longest common subsequence BDABand the occurrence positions thereof in the character sequences I and IIare determined from the link “(7, 5) of the data[4]→(6, 4) of thedata[3]→(5, 2) of the data[2]→(2, 1) of the data [1]”. Further, thelongest common subsequence BCAB and the occurrence positions thereof inthe character sequences I and II are determined from the link “(7, 5) ofthe data[4]→(6, 4) of the data[3]→(3, 3) of the data[2]→(2, 1) of thedata [1]”. Moreover, the longest common subsequence BCBA and theoccurrence positions thereof in the character sequences I and II aredetermined from the link “(6, 6) of the data[4]→(4, 5) of thedata[3]→(3, 3) of the data[2]→(2, 1) of the data [1]”.

[0141]FIGS. 8 and 9 shows a processing flow that is executed when theLCS detection unit 30 specifies the longest common subsequence bytracing this link.

[0142] In Step 20, of FIG. 8, leading data of the link of the LCS istaken from a data[Kmax]. In Step 22, a retrieval processing subroutineis called to trace and output all the data of the link following theleading data. In Step 24, it is decided whether other data still remainsin the data[Kmax]. This routine ends if the processing is completed,while returning to Step 22 if any data remains. This routine iscontinued until the links of all the LCS are completed. The retrievalprocessing subroutine shown in FIG. 9 is a recursive routine. In Step30, it is determined whether the taken data is an end terminal of thelink of the LCS by checking the data taken when this subroutine iscalled. If the determination result is in the affirmative in Step 30,this subroutine returns to the main routine shown in FIG. 8 afterexecuting an output processing in Step 32. If the determination resultis in the negative in Step 30, the pointer linked to this data is takenout in Step 34. In Step 36, by checking the content of this pointer, itis determined whether there exists any pointer to be linked to otherdata. If no other pointer exists, the data linked to the above pointeris taken out in Step 38, and the next link is traced by calling thissubroutine recursively in Step 40. If other data exist in Step 36, thedata linked to the pointer is taken out in Step 42 and the next link istraced by calling the subroutine recursively. Upon completion of theprocessing of Step 44, the next pointer is taken out in Step 46 and thissubroutine returns to Step 36, thereby executing processing for the nextbranch.

[0143] By executing the above processing, for example, the data (7, 5),(6, 4), (5, 2), and (4,1) are sequentially taken out in the exampleshown in FIG. 7, the LCS “BDAB” and the occurrence position thereof areoutput. Then, (2, 1) is taken out to obtain the data (7, 5), (6, 4), (5,2) and (2, 1), and the LCS “BDAB” and the occurrence position thereofare output. Further, the data (7, 5), (6, 4), (3, 3), and (2, 1) areobtained and the LCS “BCAB” is output. Moreover the data (6, 6), (4, 5),(3, 3), and (2, 1) are obtained and the LCS “BCBA” is output. In thisway, all the LCS are output.

[0144] A processing, such that the respective processing units 31 to 35of the gene information survey apparatus 1 shown in FIG. 1 execute uponreceipt of the length of LCS, the longest common subsequence, and theiroccurrence positions detected by the LCS detection unit 30, will bedescribed.

[0145] When the LCS detection unit 30 decides the length of LCS betweenthe character sequence of the amino acid sequence input from the inputdevice 40 (hereinafter referred to as an input amino acid sequence) andthe character sequence of the amino acid sequence given from the aminoacid sequence data base 50 or the motif data base 60, the homologydecision unit 31 determines the ratio of the length of LCS to the lengthof the character sequence of the input amino acid sequence. In the casewhere this ratio is greater than a predetermined reference value, theinput amino acid sequence is determined to be homologous with the aminoacid sequence given from the amino acid sequence data base 50 or themotif data base 60. In the case where this ratio is smaller than thepredetermined reference value, the input amino acid sequence isdetermined not to be homologous with the amino acid sequence given fromthe data base 50 or 60.

[0146] Based on the decision result of the homology decision unit 31,the homology search unit 32 seaches the amino acid sequence data base 50for an amino acid sequence being homologous with the input amino acidsequence. In the case where the two amino acid sequences are homologous,the ratio calculated by the homology decision unit 31 and the longestcommon subsequence determined by the LCS detection unit 30 are displayedin the display device 42 through the display control unit 35.

[0147]FIG. 10 shows an example of this display. The display exampledisplays a processing result of two amino acid sequences: humancytochrome c and bacteria cytochrome c. The longest common subsequencesare displayed in accordance with a display mode indicative of theinterval at which they are arranged in the two amino acid sequences.More specifically, by adopting a mode of displaying “GD{×3, 3} G{×0, 1}K{×0, 2} . . . ”, the longest common subsequences are displayed asfollows. In the human cytochrome c, “GD” is followed by three charactersthat do not coincide, followed by “G”, which is immediately followed by“K”. On the other hand, in the bacteria cytochrome c, “GD” is followedby three characters that do not coincide, followed by “G”, which isfollowed by one character that does not coincide. “K” followsimmediately thereafter.

[0148] The motif search unit 33 first searches the motif data base 60for the motif sequence being homologous with the input amino acidsequence based on the decision result of the homology decision unit 31,and then decides whether the homologous motif sequence is a true motifsequence included in the input amino acid sequence in accordance withthe longest common subsequences determined by the LCS detection unit 30and the length of the character sequence between the longest commonsubsequences. For instance, it is determined whether the input aminoacid sequence includes a motif sequence called leucine zipper in which“L” is followed by unspecified six characters, which is followed againby “L” and a total of 5 “L” are included together with the sixunspecified characters. In the case where the input amino acid sequenceincludes the motif sequence, the motif search unit 33 displays the inputamino acid sequence and the motif sequence in the display device 42through the display control unit 35. FIG. 11 shows a display example ofa rat egg cell potassium channel including a motif called the leucinezipper.

[0149] Upon receipt of the longest common subsequences and theiroccurrence positions that the LCS detection unit 30 detects, thealignment unit 34 aligns the input amino acid sequence and the aminoacid sequence given from the amino acid sequence data base 50 and themotif data base 60 so as to relate the longest common subsequence in oneamino acid sequence to that in the other, and displays the aligned aminoacid sequences in the display device 42 through the display control unit35. FIG. 12 shows an example of this display, which displays aprocessing result of two amino acid sequences: human cytochrome c andbacteria cytochrome c. The alignment processing is carried out byinserting a blank corresponding to the length of the character sequencebetween the positions of the subsequences.

Analysis of Three-Dimensional Structures of Molecules I

[0150] A method of partially relating elements including an atom or anatomic group in three-dimensional structures of molecules, particularlyprotein molecules, and comparing with each other, will be described.

[0151] For instance, it is assumed that there are substances expressedby a point set A={a₁, a₂, . . . , a_(i), . . . , a_(m)} as shown in FIG.13A and a point set B={b₁, b₂, . . . , b_(j), . . . , b_(n)} as shown inFIG. 13B. The elements constituting these substances A and B are relatedto each other as shown in FIG. 13C, and the substance B is rotated andmoved so that the r.m.s.d value between the corresponding elements isminimized, as shown in FIG. 13D. The r.m.s.d value is obtained in thefollowing equation wherein U denotes a rotation matrix and w_(k) denoterespective weights:${r.m.s.d.} = \frac{\left( {\sum\limits_{k = 1}^{n}\left( {w_{k}\left( {{Ub}_{k} - a_{k}} \right)}^{2} \right)} \right)^{\frac{1}{2}}}{n}$

[0152] A technique of obtaining the rotation and movement of thesubstances which minimizes the r.m.s.d value between these correspondingpoints is proposed by Kabsh et al. as described above, and is presentlywidely used.

[0153] 1. Various Methods of Determining Correspondence

[0154] (1) Generation of correspondence of point sets that are notordered

[0155] The substances A and B are expressed, respectively, by the pointsets A={a₁, a₂, . . . , a_(i), . . . , a_(m)}, 1≦i≦m, and the point setB={b₁, b₂, . . . , b_(j), . . . , b_(n)}, 1≦j≦n. The respective pointsa_(i)=(x_(i), Y_(i), z_(i)) and b_(j)=(x_(j), Y_(j), z_(j)) areexpressed as a three-dimensional coordinate. In this case, thecorrespondence of elements between these point sets is in principleobtained by relating sequentially the points in the respective sets, andit can be accomplished to generate all combinations by creating a treeconstruction as shown in FIG. 14A.

[0156]FIG. 14B shows an example of correspondence in the case where apoint set A includes three elements and a point set B includes fourelements, i.e., the correspondence between the point set A={a₁, a₂, a₃}and the point set B={b₁, b₂, b₃, b₄}. A dotted line represents generatedcandidates, and a solid line represents an optimum correspondence (a₁andb₂, a₂ and b₃, a₃ and b₄) among all the generated candidates.

[0157] In this figure, nil corresponds to a case where no correspondingpoint exists. By using the nil, an optimum correspondence can begenerated even in the case where the number of elements of one setdiffers from that of the other. An optimum correspondence can begenerated by applying Kabsh's method to thus generated combinations, andselecting a combination whose root mean square distance value (r.m.s.d.value) is smallest.

[0158] However, using this technique it is generally impossible toeffect a calculation since, for example, n^(m) combinations aregenerated. Specifically, In the case of the point set A (m points) andthe point set B (n points), which are not ordered, if (i) is assume tobe the number of nil the number of generated combinations is expressedas follows:${\sum\limits_{i = 0}^{m}\left( {}_{n}{P_{m - i} \times_{m}C_{i}} \right)} = {\sum\limits_{i = 0}^{m}{\frac{n!}{n - m + i} \times \frac{m!}{{i!}\left( {m - i} \right)}}}$

[0159] Here, if it is assumed that n=4, m=3, the above equation isexpressed as follows. $\begin{matrix}{{\sum\limits_{i = 0}^{3}\left( {}_{4}{P_{3 - i} \times_{3}C_{i}} \right)} = {\sum\limits_{i = 0}^{3}{\frac{4!}{\left( {4 - 3 + i} \right)!} \times \frac{3!}{{i!}{\left( {3 - i} \right)!}}}}} \\{= {{\frac{4!}{1!} \times \frac{3!}{3!}} + {\frac{4!}{2!} \times \frac{3!}{{1!}{2!}}} + {\frac{4!}{3!} \times \frac{3!}{{2!}{1!}}} + {\frac{4!}{4!} \times \frac{3!}{3!}}}} \\{= {{24 + 36 + 12 + 1} = 73}}\end{matrix}$

[0160] In other words, 73 combinations are generated, as in the case ofthe point set A (3 points) and the point set B (4 points) shown in 14B.In reality, a huge number of combinations are generated since the numberof points (elements) are usually far greater than these.

[0161] Accordingly, in generating correspondence between these sets, itis designed to generate an optimum combination in view of the geometricrelationship within the respective sets, the threshold value condition,and the attribute of points described in detail in (4), (5), (6) below.

[0162]FIG. 15 shows an example of algorithm of generating correspondencebetween the point sets A and B including elements, namely points, thatare not ordered.

[0163] The elements a are taken individually from the point set A, andcombined with elements b_(j), which are not included in ancestors orsiblings in the tree structure yet. Then, it is determined whether thiscombination satisfies a restriction condition to be described later. Ifthe combination satisfies the restriction condition, it is registered inthe tree structure and the next element is related.

[0164] (2) Generation of ordered point sets

[0165] The substances A and B are expressed, respectively, by the pointsets A={a₁, a₂, . . . , a_(i), . . . , a_(m)}, 1≦i≦m, and the point setB={b₁, b₂, . . . , b_(j), . . . , b_(n)}, 1≦j≦n. The respective pointsa_(i)=(x_(i), Y_(i), z_(i)) and b_(j)=(x_(j), Y_(j), z_(j)) areexpressed as a three-dimensional coordinate. In the point set A, anorder relationship is established: a₁<a₂<. . . <a_(i)<. . . <a_(m)(ora₁>a₂>. . . >a_(i)>. . . >a_(m)). Likewise, in the point set B an orderrelationship is established: b₁<b₂<. . . <b_(j)<. . . <b_(n) (or b₁>b₂>.. . >b_(j)>. . . >b_(n)).

[0166] In this case, elements of these point sets are in principlerelated to each other in accordance with the order relationship, and allcombinations can be generated by creating a tree structure shown in FIG.16A. FIG. 16B shows an example case where the point set A includes threeelements and the point set B includes four elements. In other words,FIG. 16B shows the correspondence between the ordered point set A={a₁,a₂, a₃} (order relationship thereof is: a₁<a₂<a₃) and the ordered pointset B={b₁, b₂, b₃, b₄} (order relationship thereof: b₁<b₂<b₃<b₄).

[0167] A dotted line represents generated candidates for correspondence,and a solid line represents an optimum correspondence (a₁ and b₂, a₂ andb₃, a₃ and b₄) among the generated candidates. In this figure, nilcorresponds to a case where no corresponding point exists. By using thenil, an optimum correspondence can be generated even in the case wherethe number of elements of one set to be related differs from that of theother to be related. An optimum correspondence can be generated byapplying Kabsh's method to thus generated combinations, and selecting acombination whose root mean square distance value (r.m.s.d. value) issmallest.

[0168] The number of generated combinations is expressed as follows inthe case of the ordered point sets:${\sum\limits_{i = 0}^{m}\left( {}_{n}{C_{m - i} \times_{m}C_{i}} \right)} = {\sum\limits_{i = 0}^{m}{\frac{n!}{{\left( {m - i} \right)!}{\left( {n - m + i} \right)!}} \times \frac{m!}{{i!}{\left( {m - i} \right)!}}}}$

[0169] Here, if it is assumed that n=4, m=3, the number of combinationsis as follows. $\begin{matrix}{{\sum\limits_{i = 0}^{3}\left( {}_{4}{C_{3 - i} \times_{3}C_{i}} \right)} = \quad {\sum\limits_{i = 0}^{3}{\frac{4!}{{\left( {3 - i} \right)!}{\left( {4 - 3 + i} \right)!}} \times \frac{3!}{{i!}{\left( {3 - i} \right)!}}}}} \\{= \quad {{\frac{4!}{{3!}{1!}} \times \frac{3!}{3!}} + {\frac{4!}{{2!}{2!}} \times \frac{3!}{{1!}{2!}}} + {\frac{4!}{{1!}{3!}} \times}}} \\{\quad {\frac{3!}{{2!}{1!}} + {\frac{4!}{4!} \times \frac{3!}{3!}}}} \\{= \quad {{4 + 18 + 12 + 1} = 35}}\end{matrix}$

[0170] In the case of the point set A (3 points) and the point set B (4points) as shown in FIG. 16B, 35 combinations are generated.

[0171] If the order relationship is applied to the respective elementswithin the point sets in this way, the number of generated combinationcan be reduced greatly compared to (1). Further, in relating these sets,an optimum combination can be generated in view of the geometricrelationship within the respective sets, the threshold value condition,and the attribute of points described in detail in (4), (5), (6) below.

[0172]FIG. 17 shows an example of an algorithm for relating elements ofthe ordered point sets A and B.

[0173] The elements a are taken individually from the point set A, andcombined with elements b_(j) which are not yet included in ancestors orsiblings in the tree structure and are larger than elements of a parentnode. Then, it is determined whether this combination satisfies therestriction condition. If the combination satisfies the restrictioncondition, it is registered in the tree structure and the next elementis related.

[0174] (3) Generation of correspondence of ordered or nonordered pointsets that are partially related to each other.

[0175] In the case of (1) or (2), there are cases where pairs of pointsthat are partially related are determined in advance. In this case,while referring to information on the elements related in advance, theremaining elements of the respective point sets are sequentially relatedsimilar to the technique (1) or (2), thereby creating a tree structureas shown in FIG. 18. In this way, all combinations can be generated.

[0176] In FIG. 18, indicated at x is a portion to be pruned based on thepartial correspondence. This figure shows a correspondence in the casewhere the element a₁ of the point set A and the element b₂ of the pointset B are related to each other in advance. Similar to (1), (2), inrelating these sets, an optimum combination can be generated in view ofthe geometric relationship within the respective sets, the thresholdvalue condition, and the attribute of points described in detail in (4),(5), (6) below.

[0177] (4) Refining of candidates based on a geometric relationship

[0178] Since the generation of unnecessary combinations can be preventedby generating correspondence between elements of point sets consideringa geometric relationship, the points sets can be related efficiently.

[0179] (a) Refining of candidates based on a distance relationship

[0180] In relating the points set, there is a distance relationshipestablished between s (1≦s≦m−1, n−1) points close to an element a_(i)within the point set A: |a_(i)−a_(i−s)|, and another distancerelationship established between s points close to an element b_(j)within the point set B: |b_(j)−b_(j−s)|. The number of candidates to berelated can be reduced by selecting and relating points that willsatisfy a relationship: ||a_(i)−a_(i−s)|−|b_(j)−b_(j−s)||≦Δd wherein Δddenotes a permissible error range.

[0181]FIGS. 19A and 19B show an example using the geometric relationshipin the case where the point b_(j) of the point set B corresponding tothe element a_(i) of the point set A is selected. Each numerical valuein these figures shows a distance.

[0182] As shown in FIG. 19A, there is assumed to be a distancerelationship established between two (s=2) points a_(i−1), a_(i−2) closeto the element a_(i) of the point set A: |a_(i)−a_(i−1)|=2.0,|a_(i)−a_(i−)2 |=3.0. As shown in FIG. 19B, among the elements b_(p),b_(q), b_(r) of the point set B is selected such a point that a distancerelationship between two elements close to this point lies within thepermissible error range Δd=0.5, and this point is related. In thisexample, the point b_(p)(|b_(p)−b_(j−1)|=2.2, |b_(p)−b_(j−2)|=3.3) isfound to satisfy the distance relationship as a result of comparing thedistance between the points as a geometric relationship, the point b_(p)is selected as a candidate for b_(j).

[0183] (b) Refining of candidates based on an angle relationship

[0184] In the case where the three-dimensional structures are similar toeach other, it can be considered that angles defined by the respectivepoints constituting the three-dimensional structures are also similar.In a three-dimensional structure, there exist an angle θ defined bythree points and an angle φ defined between planes formed by three amongfour points. Hereafter, a method of reducing the number of points to berelated will be described, taking the angle θ defined by the threepoints as an example.

[0185] In relating the sets, the number of candidates for a point to berelated is reduced by selecting and relating such points from the pointsets A and B such that an angle defined between s (2≦s<m−1, n−1)elements close to element b_(j) of the point set B relative to an angledefined between s points close to the element a_(i) of the point set Alies within a permissible error range Δθ.

[0186]FIG. 20B shows a case where, considering angles defined byrespective elements as a geometric relationship established between theelements of the point set A, the points of the point set B are relatedbased on this consideration.

[0187] In the case where an angle defined by the element ai of the pointset A and two (s=2) points a_(i−1), a_(i−2) close to the element a_(i)is θ_(a), and angles defined by the elements b_(p), b_(q), b_(r) and twoelements b_(j−1), b_(j−2) close to these elements b_(p), b_(q), b_(r)are θ_(p), θ_(q),θ_(r), points such that an angle difference lies withinthe permissible error range Δθ are selected and related. In this figure,since only the point b_(p) satisfies the relationship: |θ_(a)−θ_(p)|≦Δθ,the point b_(p) is selected as a candidate for b_(j).

[0188] (c) Refining of candidates based on distances and angles from acenter of gravity.

[0189] If the three-dimensional structures are similar to each other, itcan be considered that distances and angles from a center of gravity aresimilar. Accordingly, the number of candidates for a point to be relatedcan be reduced by calculating the center of gravity from the selectedpoints, and comparing the distances and angles using a technique similarto (a) and (b).

[0190] (5) Refining of candidates based on a threshold value condition

[0191] The point sets can be more efficiently related by setting aspecified threshold value in the aforementioned methods (1) to (4), andpruning a retrieval path if an attribute value of a candidate is greaterthan this threshold value. As this threshold value, for example,restriction in a nil number (the number of nil) and restriction in ar.m.s.d. value can be used.

[0192] (a) Restriction in a nil number

[0193] When a total number of nil becomes too large among the generatedcombinations, meaningless candidates for combinations are generated as aresult. Accordingly, in relating the elements of the point sets A and B,if the total number of nil becomes in excess of a given threshold value,the generation of the unnecessary candidates can be prevented byexcluding these from the candidates, thereby relating the elements moreefficiently.

[0194]FIG. 21 shows an example of pruning in a case where a total numberof nil is restricted to 0 in relating a point set A={a₁, a₂, a₃} to apoint set B={b₁, b₂, b₃, b₄}. In this figure, a portion designated at xin a tree structure is a portion to be pruned.

[0195] (b) Restriction in an r.m.s.d. value

[0196] In the case where an r.m.s.d. value of all the points relatedthus far becomes exceedingly bad by relating an element a_(i) of a pointset A to an element b_(j) of a point set B, it is preferable to excludethis point from consideration of the candidates. In view of this, ther.m.s.d. value of all the points when the element a_(i) is related tothe element b_(j) is calculated, and this point is selected as acandidate if the calculated r.m.s.d value is not greater than a giventhreshold value. On the contrary, this point is excluded from thecandidates if the r.m.s.d value is in excess of the given thresholdvalue. In this way, the candidates for a point to be related can begenerated more efficiently.

[0197] (6) Refining of candidates based on an attribute of a point

[0198] The number of candidates for a point to be related can be reducedby using an attribute of the point in relating an element a_(i) of apoint set A to an element b_(j) of a point set B. The attributes of thepoint, for example, include the type of an atom, an atomic group, and amolecule, the hydrophilic property, the hydrophobic property, and thepositive or negative charge. It is determined whether the point isselected as a candidate by checking whether these attributes coincide.

[0199] For example, in the case of relating elements constitutingproteins, the number of candidates for a point to be related can bereduced by using the type of an amino acid residue (corresponding to anatomic group) as an attribute of the point. Regarding the types of aminoacid residues or the like, please refer to references such as“Fundamental to Biochemistry,” pp. 21-26, Tokyo Kagaku Dohjin Shuppan.

[0200] Further, the candidates for the point to be related can bereduced by adding a restriction to a specific element. For example, thecandidates to be retrieved can be reduced by providing the restrictionthat the nil is not inserted to a certain point or by designating anattribute of point to a certain point.

[0201] 2. Adaptation Examples.

[0202] Described below are adaptation examples where the theme consistsof a protein as a three-dimensional structure of a substance. Here,however, there is no particular limitation except that the subjectbasically has three-dimensional structure, and the invention can beadapted to even those having general molecular structures relying uponthe same method.

[0203] (1) Device for displaying the superposition of molecularstructures.

[0204] In examining properties of a substance, the molecules aresuperposed one upon another, and a common portion or specific isdiscriminated so as to analyze or predict properties of the substances.Since such operations have been effected manually, a device thatautomatically displays the molecular structures in an overlapped manneris preferred.

[0205]FIG. 22 is a diagram of system constitution of a device thatdisplays the molecular structures in an overlapped manner according tothe present invention. This device is constituted by a data base 80 inwhich are registered data related to the three-dimensional structures ofsubstances, a data input unit 82 that reads the registered data and aninput command from a user, a superposition calculation unit 84 thatsuperposes the three-dimensional structures (three-dimensionalcoordinates) of the substance read from the data base 80 on the methodof superposition discussed above in subsection 1 on page 28 of thisapplication entitled “Various Methods of Determining Correspondence”, 28of this application r.m.s.d. values will become the smallest, and agraphic display unit 86 that displays the three-dimensional structuresin an overlapped manner based on the calculated results.

[0206] (a) Data base 80.

[0207] The data input base 80 stores the data related tothree-dimensional structures of substances, i.e., stores the names ofsubstances, three-dimensional coordinates of atoms constituting thesubstances, etc.

[0208] (d) Data input unit 82.

[0209] The data input unit 82 reads from the data base the data(three-dimensional coordinates) of substances that are to be superposedbased on an input command of a user, and sends the data to thesuperposition calculation unit 84.

[0210] (c) Superposition calculation unit 84.

[0211] The superposition calculation unit 84 determines correspondenceamong the elements that constitutes the substances in order to superposethree-dimensional structures (three-dimensional coordinates) ofsubstances according to the method of superposition discussed in Section1, entitled “Various Methods of Determining Correspondence”, on page 28of this application in a manner such that optimum r.m.s.d values areobtained, and sends the results to the graphic display unit 86. Indetermining the correspondence, there is provided a function that findscorrespondence between spatially similar portions based on the order ofamino acid sequence that constitutes a protein, and a function thatfinds correspondence between spatially similar portion irrespective ofthe order of amino acid sequence. In retrieving the spatially similarportions based on the order of amino acid sequence, amino acidsconstituting the protein can be grasped as an ordered set whose elementsare ordered according to the numbers of amino acid sequence, andthe4efore similar portions can be calculated based on the methodsdiscussed in Section, subsections (2), (3), 94), (5), and (6) on pages30, 32, 33, 35, and 36, respectively, of this application. By graspingthe amino acids simply as a nonordered set, furthermore, it is possibleto calculate spatially similar portions irrespective of the order ofamino acid sequence relaying upon the systems mentioned in section 1,subsections (1), (3), (4), (5) and (6) on pages 30, 32, 33, 35, and 36,respectively, of this application.

[0212] (d) Graphic display unit 86.

[0213] The graphic display unit 86 displays the three-dimensionalstructures of substances in a superposed manner based on the resultscalculated by the superposition calculation unit 84. Upon looking at thedisplayed result while manually rotating it, it is understood whatportions are superposed and how they are superposed in a 3D graphic.

[0214]FIG. 23A shows an amino acid sequence of calmodulin, which is aprotein, and FIG. 23B shows an amino acid sequence of troponin C. FIGS.23A and 23B show in excerpts the amino acid sequences registered to thePDB. The amino acid sequence shown in FIG. 23A lacks amino acids thatcorrespond to amino acid. sequence Nos. 1-4 and 148 included in theordinary amino acid sequence and, hence, the numbers are shifted.Hereinafter, these diagramed amino acid sequence numbers will be used.As shown in FIG. 24A, it is known from results of biochemicalexperiments that calmoduline can bind four Ca²⁺ as indicated by blackrounds. Also, it is known that troponin C can bind two Ca²⁺ as indicatedby black rounds in FIG. 24B. It is known that calmoduline has fourplaces (sites) to bind Ca²⁺ in its amino acid sequence and among theseamino acids of sequence numbers 81-108 and 117-143 form skeletonssimilar to those of two sites to bind Ca²⁺ in troponin C. A protein isconstituted by amino acids and it is known that its skeleton can berepresented by the coordinates of atoms (Cα) that constitute the aminoacids. FIG. 25 shows the results obtained when a spatially similarportion (a single site) is searched for based on the order of amino acidsequence using the Ca²⁺ binding site 81-108 of calmodulin as a probe.FIG. 25 indicates that the amino acid sequence numbers 96-123 introponin C. correspond to the Ca²⁺ binding sites 81-108 in calmodulin.These results are in agreement with the biochemically experimentedresults. FIG. 26 shows the results obtained when spatially similarportions (a plurality of sites) are searched for based on the order ofamino acid sequence using Ca²⁺ binding site 81-108 and 117-143 incalmodulin as probes. FIG. 26 indicates that the amino acid sequencenumbers 96-123 and 132-158 in troponin C correspond to the Ca²⁺ bindingsites 81-108 and 117-143 in calmodulin. These results are in agreementwith the biochemically experimented results, too. By using the apparatusof the present invention as described above, correspondence among theconstituent elements of substances can be calculated in a manner suchthat the r.m.s.d. values are minimized in the three-dimensionalstructures of the substances. By displaying the corresponding portionsin a superposed manner, therefore, it becomes possible to display thesubstances in a superposed manner in an optimum condition.

[0215] (2) Three-dimensional structure retrieval device and functiondata base generating device

[0216] It is essential to clarify a correlation between the function andthe structure of a substance in order to develop a substance having anew function such as a new medicine or to improve the function of asubstance that already exists. To promote the aforesaid work, it becomesnecessary to make references to many substances having similarthree-dimensional structures. This necessitates a three-dimensionalstructure retrieving device that is capable of easily taking out thesubstances having similar three-dimensional structures form the database. Moreover, a device of this kind makes it possible to prepare afunction data base in which are collected three-dimensional structuresthat are related to the functions. The function data base will bedescribed later in (3). FIG. 27 is a diagram illustrating the systemconstitution of a three-dimensional structure retrieving device that isconstituted by a data base 80 that stores three-dimensional structuresof substances, a data input unit 82 that reads the data registered tothe data base 80 and an input command of a user, a similaritycalculation unit 88 that retrieves structures similar tothree-dimensional structures (three-dimensional coordinates) ofsubstances read form the data base 80 and which minimize the r.m.s.d.value, based on the method of superposition mentioned in the Chapter 1,and a retrieved result display unit 90 that displays the retrievedresults. FIG. 28 is a diagram showing the system constitution of adevice that generates a function data base.

[0217] (a) Data base 80.

[0218] The data base 80 stores the data related to three-dimensionalstructures of substances, i.e., stores the names of substances, thethree-dimensional coordinates of atoms constituting the substances, etc.

[0219] (b) Data input unit 82.

[0220] The data input unit 82 reads the data of three-dimensionalstructures that serve as keys for retrieval and the data ofthree-dimensional structures registered to the data base 80 that will bereferred to during the retrieval based on the input command from theuser, and sends the data to the similarity calculation unit 88.

[0221] (c) Similarity calculation unit 88.

[0222] The similarity calculation unit 88 calculates optimumsuperposition of three-dimensional structures. At this moment, there areprovided a function for retrieving spatially similar portions based onthe order of amino acid sequence that constitutes a protein, andfunction for retrieving spatially similar portions irrespective of theorder of amino acid sequence. In retrieving the spatially similarportions based on the order of amino acid sequence, amino acidsconstituting the protein can be grasped as an order set whose elementsare ordered according to the numbers of amino acid sequence, andtherefore similar portions can be calculated based on the methodsdescribed in section 1, subsections (2), (3), (4), (5), and (6) on pages30, 32, 33, 35, and 36, respectively, of this application. By graspingthe amino acid simply as a nonordered set, furthermore, it is possibleto calculate spatially similar portions irrespective of the order ofamino acid sequence relying upon the systems mentioned in section 1,subsections (1), (2), (3), (4), (5) and (6), on pages 28, 32, 33, 35 and36 respectively, of this application.

[0223] (d) Retrieved result display unit 90.

[0224] The retrieved result display unit 90 expresses similar portionsas amino acid sequence names and amino acid numbers based on the resultsof the similarity calculation unit 86, and displays r.m.s.d. values as ascale of similarity.

[0225]FIG. 29 shows the results obtained when similar three-dimensionalstructures are retrieved form the PDB using, as probes, coordinates ofCα corresponding to the amino acid residue Nos. 7 to 14 in elongationfactor of protein which is a binding site for phosphoric acid of GTP(guanosine triphosphate). Retrieval is carried out over 744three-dimensional structures of protein among 824 data registered to thePDB. FIG. 29 shows amino acid residue numbers of a target protein thatis retrieved, an amino acid residue sequence, an amino acid residuesequence of a probe, and r.m.s.d. values between target and probethree-dimensional structures. As a result, eight three-dimensionalstructures are retrieved (including probe itself). If classifieddepending upon the kinds of proteins, there are retrieved threeadenylate kinases, two elongation factors (between them, one is probeitself) and three ras proteins, all of them are the sites wherephosphoric acid of ATP or GTP is bound. Thus, the function of sitesbinding phosphoric acid of ATP or GTP has a very intimate relationshipto their three-dimensional structures and their structures are veryspecific because they never incidentally coincide with other structuresthat are not phosphoric acid binding sites. In FIG. 30, the retrievedresults are partly shown by their three-dimensional structures.

[0226] By using this device as described above, it is possible toretrieve similar structures from the data base in which are storedthree-dimensional structures of substances by designating thethree-dimensional structure of a substance that serves as a probe.

[0227] (3) Function predicting device.

[0228] As will be implied from the results shown in FIG. 29, it isconsidered that a protein has a three-dimensional structure thatspecificaly develops its function. Therefore, if a data base(hereinafter referred to as function data base) of three dimensionalstructures specific to the function is provided for each of thefunctions, the it becomes possible to predict what function is exhibitedby a substance and by which portion (hereinafter referred to as functionsite) of the three-dimensional structure the function is controlled byexamining whether the structures registered to the function data basesexist within the three-dimensional structure of the substance is newlydetermined by the X-ray crystal analysis or NMR. FIG. 31 illustrates thefunction predicting device which is constituted by a data input unit 82that receives as inputs the three-dimensional structures of substances,a function data base 92 to which are registered the three-dimensionalstructures that are related to functions, a function prediction unit 94that performs optimum superposition of the three-dimensional structureread from the function data base 92 and the three-dimensional structureof a substance that is an input based on the method of retrieving thethree-dimensional structure described in section 1 on page 28 in orderto determine whether the three-dimensional structure includes astructure related to the function, and specifies the function sites, anda predicted result display unit 96 that displays the predicted results.

[0229] (a) Data input unit 82.

[0230] The data input unit 82 reads the data of three-dimensionalstructures constituting substances and sends them to the functionprediction unit. ‘(b) Function data base 92.

[0231] The function data base 92 stores the functions of substances anddata related to three-dimensional structures specific to the functions.The data base stores the names of functions, and three-dimensionalcoordinates of atoms constituting three-dimensional structures specificto the functions, etc. The function data base 92 is formed by a functiondata base-generating device (FIG. 28) that is constituted similarly tothe three-dimensional structure retrieving device described in (2)above.

[0232] (c) Function prediction unit 94.

[0233] The function prediction unit 94 calculates the optimumsuperposition of three-dimensional structures registered to the functiondata base 92 and three-dimensional structures that are input. At thismoment, there are provided a function for retrieving spatially similarportions based on the order of amino acid sequence that constitute aprotein, and a function for retrieving spatially similar portionsirrespective of the order of amino acid sequence. In retrieving thespatially similar portions based on the order of amino acid sequence,amino acids constituting the protein can be grasped as an ordered setwhose elements are ordered according to the numbers of amino acidsequence, and therefore similar portions can be calculated based on themethods described in section 1, subsections (2), (3), (4), (5) and (6)on pages 30, 32, 33, 35, and 36, respectively, of this application. Bygrasping the amino acid sequence simply as a nonordered set,furthermore, it is possible to calculate spatially similar portionsirrespective of the order of amino acid sequence relying upon thesystems mentioned in section 1, subsections (3), (4), (5) and 6 on pages30, 32, 33, 35, and 36, respectively, of this application.

[0234] (d) Predicted result display unit 96.

[0235] The predicted result display unit 96 expresses the names offunctions, names of amino acid sequences at function sites and aminoacid residue numbers registered to the function data base relying on theresults of the function prediction unit 94, and displays r.m.s.d. valuesas a scale of similarity.

Analysis of Three-dimensional Structures of Molecules II

[0236] In the aforementioned method of imparting correspondence, similarstructures were successfully picked up by refining the candidates bytaking into consideration such threshold conditions as geometricalrelations such as distances among the elements in a point set, r.m.s.d.values and the number of nils, as well as attributes of constituentelements (kinds of amino acids in the case of a protein), and by findingoptimum combinations. Still, extended periods of time are often requiredfor calculating under certain shape conditions of the three-dimensionalstructure, the number of elements that constitute a point set,geometrical limitations and threshold values. Therefore, the calculationmust be carried out at higher speeds. It, however, is difficult toestablish a method that is capable of executing the processings at highspeed under any condition.

[0237] As shown in FIGS. 32A and 32B, therefore, the three-dimensionalstructures (partial structures) of molecules are divided into thosehaving linear structures and those having non-linear structures. Amongthem, those having linear structures are processed at a higher speedusing a method described below.

[0238] Referring to FIG. 32A, the structure in which two points at bothends of a three-dimensional structure are most distant from each otheris called a linear structure. Referring to FIG. 32B, on the other hand,the structure in which two points at both ends are not most distant fromeach other is called a non-linear structure.

[0239] In accomplishing correspondence among the elements between pointsets A and B that form three-dimensional structures, according to thisembodiment, after the point set B is divided depending upon the spatialsize or the number of constituent elements of the point set A in orderto find subsets of points that are candidates for the correspondingpoints, the optimum correspondence is effectively searched for withrespect to each of the subsets. Described below is a method of findingthe subsets.

[0240] (1) Division of an ordered point set B according to the number ofconstituent elements of a point set A.

[0241]FIG. 33 is a diagram explaining how to divide a point set Baccording to the number of constituent elements of a point set A.

[0242] The size of search space is decided according to the number m ofelements of the point set A, and the point set B is divided according tothe size in order to reduce the space to be searched, thereby shorteningthe time for calculation. In an example of FIG. 33, a size 10, which istwice as great as the number 5 of elements of the point set A, is set tobe the size of a space to be searched, in order to effect theprocessing.

[0243]FIG. 34 shows a division algorithm for the point set B.

[0244] Ordered point sets are given as A=[a₁, - - - , a_(m)],B=(b₁, - - - , b_(i), - - - , b_(j), - - - , b_(n)], and the followingprocessing is effected for the subset B′ of the point set B.

[0245] Process 1:

[0246] Find the number m of elements of the point set A.

[0247] Process 2:

[0248] Set the size (f(m)) of B′ in compliance with a function f(x) thatdefines the size of the point set B′.

[0249] Process 3:

[0250] Divide the point set B to obtain the following subset B′.

[0251] (a) j=i+f(m)−l

[0252] (b) Point set B′=[b_(i), b_(i+1), - - - , b_(j−1), b_(j)]

[0253] Process 4:

[0254] The points a_(l), and b_(j) are related to each other and thenthe remaining elements of the point sets A, B′ are related to each otheraccording to the method explained with reference to FIGS. 17 to 21, inorder to find correspondence that meets a predetermined limitingcondition.

[0255] Process 5:

[0256] When b_(j) is a final element of the point set B, the program isfinished.

[0257] When b_(j) is not the final element of the point set B, obtaini=i+l and return to process 3.

[0258] (2) Division of an ordered point set B according to the spatialsize of the point set A.

[0259] As shown in FIG. 35A, a distance d is found across the two pointsat both ends of the point set A, and the point set B is divided by thedistance d as shown in FIG. 35B in order to reduce the search space,thereby shortening the time for calculation. According to this method,however, since the correspondence of a head element of the set is notfixed as mentioned with reference to the process 4 of (1), there existsa probability that the same solution may be calculated many times.Therefore, prior to advancing to the next search space, the next searchspace is set by taking into consideration the position of a solutionobtained in the previous search space, so that the search spaces willnot be overlapped and the same solution will not be calculated manytimes.

[0260]FIG. 36 is a diagram showing a division algorithm for the orderedpoint set B depending upon the spatial size of the point set A.

[0261] The ordered point sets are given as A=[a₁, - - - , a_(m)],B=[b_(l), - - - , b_(i), - - - , b_(j), - - - , b_(n)], and thefollowing process is effected for the subset B′ of the point set B.

[0262] Process 1:

[0263] Distances among points of the point sets A and B are calculatedto prepare a distance table (not shown).

[0264] Process 2:

[0265] A distance between a first point and a final point (a_(l), a_(m))in the point set A is found from the distance table, and is denoted asd.

[0266] Process 3:

[0267] Divide the point set B.

[0268] (a) Find from the distance table the one having a maximum j fromamong b_(j) that have a distance of d±α from b_(i)(i=1, in initialstate) and that satisfy m≦j-i≦2m.

[0269] (b) Obtain a point set B′=[b_(i), b_(i+1), - - - , b_(j−1),b_(j)].

[0270] Process 4:

[0271] Accomplish correspondence among the elements of point sets A, B′according to the method explained with reference to FIGS. 17 to 21, inorder to find correspondence that meets a predetermined limitingcondition.

[0272] Process 5:

[0273] When b_(i) is a final element of the point set B, the program isfinished.

[0274] Process 6:

[0275] When b_(i) is not the final element of the point set B:

[0276] i) Obtain i=k+l and return to the process 3 when a solution thatsatisfies predetermined limiting condition is met between the point setsA and B′, where a point corresponding to al is bk; or

[0277] ii) obtain i=i+l and return to the process 3 when a solution isnot obtained between the point sets A and B′.

[0278] (3) Other method of dividing the ordered point set B according tothe spatial size of the point set A.

[0279] As shown by an algorithm of FIG. 37, it is possible to divide theordered point set B depending on the spatial size of the point set A.Even in this case, a distance is found across two points at both ends ofthe point set A, and the point set B is divided by this distance toreduce the search space and to shorten the time for calculation.Moreover, at the time of advancing to the next search space, the nextsearch space is set by taking into consideration the number of elementsof the point set A that serve as search keys, so that the search spaceswill not be overlapped and the same solution will not be calculated manytimes.

[0280] The ordered point sets are given as A=[a₁, - - - , a_(m)],B=[b₁, - - - , b_(i), - - - , b_(j), - - - , b_(n)], and the followingprocess is effected for the subset B′of the point set B.

[0281] Process 1:

[0282] Distances among points of the points sets A and B are calculatedto prepare a distance table (not shown).

[0283] Process 2:

[0284] A distance between a first point and a final point (a_(l), a_(m))in the point set A is found from the distance table, and is denoted asd.

[0285] Process 3:

[0286] Divide the point set B.

[0287] (a) Find from the distance table the one having a maximum j fromamong b_(j) that have a distance of d±α from b_(i)(i=l, in initialstate) and that satisfy m≦j−i≦2m.

[0288] (b) Obtain a point set B′=[b_(i), b_(i+l), - - - , b_(j−l),b_(j)].

[0289] Process 4:

[0290] Accomplish correspondence among the elements of point sets A, B′according to the method explained with reference to FIGS. 17 to 21, inorder to find correspondence that meets a predetermined limitingcondition.

[0291] Process 5:

[0292] When b_(i) is a final element of the point set B, the program isfinished. When b_(i) is not the final element of the point set B, obtaini=j−m+l and return to the process 3.

[0293] In determining the correspondence among the points that formthree-dimensional structures, the points are related to each other afterthe search space of three-dimensional structures is divided. Therefore,the points can be related to one another within short periods of time.These methods can similarly be adapted to the processing devices thatare described with reference to FIGS. 22, 27, 28 and 31.

[0294]FIGS. 38A shows the amino acid sequence of a protein trypsin, andFIG. 38B shows the amino acid sequence of elastase. FIGS. 38A and 38Bshow excerpts of amino acid sequences registered to the PDB. The aminoacid sequence numbers shown in FIGS. 38A and 38B are those that aresimply given to the amino acids described in the PDB starting from 2 andare different from the traditional amino acid numbers. In the followingdescription, the amino acid numbers that are diagramed will be used.

[0295] The trypsin and elastase that are shown are some kinds ofproteolytic enzymes called serine protease, and in which histidine,serine and aspartic acid are indispensable at the active sites. Thoughthese enzymes have quite different substrate specificity, they areconsidered to be a series of enzymes from the point of view of evolutionsince they are similar to each other with respect to structure andcatalytic mechanisms.

[0296]FIG. 39A shows the retrieved results of histidine active sites ofelastase with the histidine active sites (36-41) of trypsin as probes.It will be understood that 41-46 of elastase correspond to the activesites 36-41 of trypsin. FIG. 39B shows the retrieved results of serineactive sites of elastase with serine active sites (175-179) of trypsinas probes from which it will be understood that 186-190 of elastasecorrespond to the active sites 175-179 of trypsin. These results are inagreement with the results obtained through biochemical experiments.

Analysis of Three-Dimensional Structures of Molecules III

[0297] Three-dimensional structures of proteins contain common basicstructures such as α-helix and β-strand which are called secondarystructures. Several methods have heretofore been developed to effectautomatic retrieval based upon the similarity in the secondarystructures without using r.m.s.d. values. According to these methods,partial structures along the amino acid sequence are denoted by symbolsof secondary structures and are compared by way of symbols, but it wasnot possible to compare similarities of spatial position relationshipsof the elements that constitute partial structures or to comparesimilarities of spatial position relationships of partial structures.

[0298] Therefore, described below are a method in which a set ofelements constituting a molecule is divided into subsets based on thesecondary structures, and the subsets are related to each other based onthe similarities of spatial position relationships of elements thatbelong to the subsets, a method of evaluating similarities of spatialposition relationships of a plurality of subsets that are related to oneanother, and a method of analysis by utilizing such methods.

[0299] (1) Division of a point set into subsets.

[0300] The structure A and the structure B are, respectively,constituted by a point set A=[a₁, a₂, a₃, - - - , a_(i), - - - , a_(m)],where l≦i≦m and a point set B=[b₁, b₂, b₃, - - - , b_(j), - - - ,b_(n)], where l≲j≲n, and each point is expressed by a three-dimensionalcoordinate consisting of a_(i)=(x_(i), Y_(i), z_(i)) and b_(j)=(x_(j),Y_(i), z_(j)).

[0301] In order to facilitate determination of the correspondence amongthe points, the structure is divided into partial structures that arestructurally meaningful, and a points set is divided into subsets.Examples of the partial structures which are structurally meaningfulinclude functional groups and partial structures having certainfunctions in the case of chemical substances, and secondary structuressuch as helixes, sheets structures and partial structures developingcertain functions in the case of proteins.

[0302] The coordinates of a partial structure are found by using theknown data or by the analysis of three-dimensional coordinates. Thepoint set A divided into subsets is denoted as A=[(a₁, a₂, - - - ,a_(k)) , (a_(k+1), a_(k+2), - - - , a₁,) , - - - , (a₁₊₁, a₁₊₂, - - - ,a_(m))], where l≦k≦Λ≦m. Here, if SA1=(a₁, a₂, - - - , a_(k)),SA2=(a_(k+1), a_(k+2), - - - , a₁), - - - , SAp=(a₁₊₁, a₁₊₂, - - - ,a_(m)), then the set SA's are subsets which constitute the points set A,and the set A is expressed by SA's as A=(SA1, SA2, - - - , SAp).Similarly, the point set B is divided into SB's which are subsets of B,and is expressed as B=(SB1, SB2, - - - , SBq).

[0303] (2) Determination of Correspondence among the subsets.

[0304] Considered below is the determination of correspondence amongelements of the structure A=(SA1, SA2, - - - , SAp) and the structureB=(SB1, SB2, - - - , sBq), i.e., to determine the correspondence amongsubsets. In this case, possible correspondence can be described by atree structure created by successively giving correspondence to theelement constituting the sets. A node of the root of the tree is astarting point. A leaf node represents a result of possible setting ofcorrespondence, and an intermediate node represents a partial result.Nil is used when there is no corresponding element.

[0305]FIG. 40 is a diagram illustrating the possible correspondence ofsubsets. If a status tree that corresponds to all possible combinationsis created, the number of nodes becomes significantly high. Therefore,the branches must be pruned. Namely, when the nodes are added by givingcorrespondence between two subsets, the matching is effected between thesubsets, and the nodes are added provided the result satisfies thelimiting condition. The limiting condition will be described later in(4). The matching of the subsets is carried out in compliance with themethod described in the “Analysis of Three-Dimensional Structures ofMolecules I”.

[0306] (3) Determination of correspondence among subsets wherein partialcorrespondence is predetermined and/or that are ordered.

[0307] When partial correspondence between subsets is predeterminedand/or when subsets are ordered in the above case (2), branches of thetree structure formed in (2) are pruned based thereon.

[0308] (4) Refining the candidates by the similarity among the subsets.

[0309] In the above methods (2) and (3), the branches are pruned basedon the similarity between the two subsets that are candidates in orderto determine the correspondence efficiently. The attributes possessed bythe candidates and the structural similarity between the two subsets aretaken into consideration. The attributes of subsets may be the kinds offunctional groups and kinds of functions in the case of chemicalsubstances, and the constituent elements in the secondary structure andthe kinds of functions in the case of proteins. The structuralsimilarity of the two subsets is judged by the three-dimensionalstructure matching method which accomplishes the correspondence amongthe elements of the two ordered point described in the “Analysis ofThree-Dimensional Structure of Molecules I”. The r.m.s.d. among thepoints is calculated when an optimum matching is effected based on thismethod.

[0310] The candidates can be refined by generating nodes ofcorrespondence only when the two subsets that are the candidates havethe same attribute and their r.m.s.d. values are smaller than athreshold value. FIG. 41 shows an algorithm for determiningcorrespondence of subsets of the sets A and B where the above limitingcondition is taken into consideration.

[0311] In FIG. 41, a subset is taken out from the point set A and isdenoted as SA. Further, and element SB that is not included in theancestor or siblings of the tree structure is taken out from the pointset SB and is denoted as d_(j). When there is no element that can betaken out, then d_(j=nil).

[0312] Then, SA and d_(j) are examined in regard to whether theirattributes are the same or not, and when the attributes are not thesame, the combination is discarded for pruning. When the attributes arethe same, the point sets are matched, and an r.m.s.d. value iscalculated under the optimum matching. When this value is smaller than apredetermined threshold value, SA and d_(j) are related to each other,and are registered as child nodes of d_(j−1) in the tree structure, andcorrespondence of an optimum point is stored in the sequence. Theabove-mentioned processing is repeated for all of the subsets.

[0313] (5) Decision of similarity between the structure A and thestructure B.

[0314] Two point sets are created using elements belonging to thesubsets related in (4) above, and an r.m.s.d. value between them iscalculated in compliance with Kabsh's method, and when the value issmaller than the threshold value, it is decided that the two structuresare similar to each other.

[0315] Described below is a system for retrieving three-dimensionalstructures of proteins using the secondary structural similarity thatcan be realized based on the above-mentioned method.

[0316]FIG. 42 illustrates the constitution of a retrieval system that ismade up of a data base 160 to which are registered three-dimensionalstructure data of proteins, a secondary structure calculation unit 161that determines a secondary structure from the three-dimensionalstructure data in the data base 160 and divides it into partialstructures, a secondary structure coordinate table 162 that stores theresults obtained by the secondary structure calculation unit 161 as atype of the secondary structure and three-dimensional coordinates ofpoints that constitute the type of the secondary structure, an inputunit 163 that reads an input command of a user, a retrieving unit 164that retrieves a similar structure based on the aforementioned methodrelying on the command that is input and the data in the secondarystructure coordinate table, and a display unit 165 that graphicallydisplays the retrieved result. Details of the units will now bedescribed.

[0317] (a) Data base 160.

[0318] The data base stores three-dimensional structure data ofproteins. Name and three-dimensional coordinate date of constituentatoms are registered for each of the proteins.

[0319] (b) Secondary structure calculation unit 161.

[0320] The secondary structure calculation unit 161 divides thestructure of a protein into types of secondary structures based on thethree-dimensional coordinates in the data base, and divides a point setinto subsets. Table I shows the types of the secondary structures andthe definitions thereof. The type the i-th amino acid belongs to issequentially determined according to the definitions shown in Table I,and subsets are created from a series of coordinates of the amino acidbelonging to the same type. The thus determined type of the secondarystructure and the coordinate data of the constituent amino acid arestored in the secondary structure coordinate table 162. By repeatingthis operation, n amino acids are all grouped into subsets. FIG. 43shows a flow of process related to the determination of the secondarystructure and division into subsets. TABLE I Types of secondarystructures and their definitions Type Definition 3₁₀-Helix Structure inwhich carbonyl group of i-th residues and amide groups of i + 3-thresidues are aligned by hydrogen bonds therebetween. α-Helix Structurein which carbonyl groups of an i-th residues and amide groups of an i +4-th residues are aligned by hydrogen bonds therebetween. ParallelStructure in which hydrogen bonds are formed between β-sheet carbonylgroups of i-1-th residues and amide groups of j-th residues and betweencarbonyl groups of j-th residues and amide groups of i + 1-th residues,or hydrogen bonds are fanned between carbonyl groups of j-1-th residuesand amide groups of i-th residues and between carbonyl groups of i-thresidues and amide groups of j + 1-th residues. 3-Turn Structure inwhich hydrogen bonds are formed between carbonyl groups of i-th residuesand amide groups of i + 2-th residues.

[0321] (c) Secondary Structure Coordinate table 162

[0322]FIG. 44 illustrates a constitution of

[0323] the secondary structure coordinate table 162 where the types ofthe secondary structures determined by the secondary structurecalculation unit 161 and the coordinate date of amino acids constitutingthe secondary structure are stored. In this example, the subsets S1 andS2 belongs to the type of α-helix and the partial sets S3, - - - belongsto the type of β-sheet.

[0324] (d) Input unit 163.

[0325] The input unit 163 reads the name of a protein that serves as aretrieval key based on the secondary structure coordinate table 162 andthe input command from the user, and sends it to the retrieving unit164.

[0326] (e) Retrieving unit 164.

[0327]FIG. 45 shows a processing carried out by the retrieving unit 164.The retrieving unit 164 reads the data stored in the secondary structurecoordinate table 162 regarding a protein that serves as a key sent fromthe input unit 163 determines the correspondence of subsets, calculatesthe r.m.s.d between the two structures, and selects the one having anr.m.s.d. value that is smaller than the threshold value, therebyretrieving the structure having a high degree of similarity. Thecorrespondence is determined based on the aforementioned method ofdetermining correspondence among the subsets. In this case, theattribute of the subsets is the type of secondary structure. Thecorrespondence is fixed only when the type of the two subsets are thesame and when the r.m.s.d. value is smaller than the threshold valuewhen the structures are best matched.

[0328] Next, points are matched with each other with regard to the setsconstituted by points that belongs to the related subsets, and ther.m.s.d. value of the whole structure is calculated. In the example ofFIG. 40, SA1 and SB1 are related to each other, and SA2 and SB3 arerelated to each other. In this case, match is effected among the pointsbelonging to the sets (SA1, SA2) and the points belonging to the sets(SB1, SA3), and the r.m.s.d. value is calculated. When the r.m.s.d.value is smaller than the threshold value, the structure is determinedto have a similarity and is registered to the retrieved result. Thisoperation is carried out for all of the proteins stored in the secondarystructure coordinate table 162, and the three-dimensional structuresthat are similar to each other in secondary structure are retrieved fromall of the data.

[0329] (f) Display unit.

[0330] Based on the results retrieved by the retrieving unit 164, thedisplay unit 165 displays the name of proteins having similarstructures, secondary structures of a key protein and proteins havingsimilar structures, and amino acids constituting the secondarystructures.

[0331]FIG. 46 shows examples of outputs. FIGS. 47A and 47B illustratethree-dimensional structures of a key protein A used in retrieval and aprotein B having a similar structure that is retrieved.

[0332] In FIGS. 47A and 47B, a partial structure of α-helix isrepresented by a helical ribbon, a partial structure of β-strand isrepresent by an arrow, and partial structures of loop and turn arerepresented by tubes. As a result, it will be understood that the keyprotein is divided into four partial structures of α-helix, β-strand,loop and β-strand in the order of amino acid sequence, and these partialstructure correspond to subsets SA1, SA2, SA3 and SA4, respectively.

[0333] Referring to FIG. 46, subsets SA1, SA2 and SA4 in A are similarto subsets SB10, SB1 and SB3 indicated by arrows in B, and are furthersimilar in their relationship of spatial positions of the three partialstructures. In A, a loop portion SA does not have an arrow indicatingthat there is no similar partial structure. Similar portions in theprotein B of similar structure are hatched in FIG. 47B.

1. A method of analyzing sequences of atomic groups including a firstsequence having m atomic groups and a second sequence having n atomicgroups where m and n are integers, comprising the steps of: a) preparingan array S[i] having array elements S[0] to S[m]; b) initializing allarray elements of the array S[i] to zero and initializing an integer jto 1; c) adding 1 to each array element S[i] that is equal to an arrayelement S[r] and that i≳r if the array element S[r] is equal to an arrayelement S[r−1] where r is an occurrence position of j−th atomic group ofthe second sequence in the first sequence; d) adding 1 to the integer j;e) repeating the step c) and d) until the integer j exceeds n; and f)obtaining a longest common atomic group number between the first and thesecond sequences from a value of the array element S[m].
 2. A method ofclaim 1, further comprising the steps of: g) preparing an array data[k]having array elements data[0], data[1] . . . ; h) storing paired data(r, j) in an array element data[k] if the array element S[i] is changedin the step c) where k=S[r]; i) linking the paired data (r, j) stored inthe step h) to paired data (r′, j′) if r′<r and j′<j where the paireddata (r′, j′) is one stored in an array element data[k−1]; and j)obtaining a longest common subsequence between the first and the secondsequences and occurrence positions of the longest common subsequence inthe first and the second sequence by tracing the link formed in the stepi):
 3. A method of claim 1 further comprising the step of k) evaluatinghomology between the first and the second sequences based on the longestcommon atomic group number and a value of one of m and n.
 4. A method ofclaim 3, further comprising the step of 1) searching for a sequence thatis homologous with the first sequence from among a plurality ofsequences, by successively assigning one of the plurality of sequencesto the second sequence and executing the steps a) to f) and k).
 5. Amethod of analyzing three-dimensional structures including a firststructure expressed by three-dimensional coordinates of elementsbelonging to a first point set and a second structure expressed bythree-dimensional coordinates of elements belonging to a second pointset, comprising the steps of: a) generating a combination ofcorrespondence satisfying a restriction condition between the elementsbelonging to the first point set and the elements belonging to thesecond point set from among all candidates for the combination ofcorrespondence; and b) calculating a root means square distance betweenthe elements corresponding in the combination of correspondencegenerated in the step a).
 6. A method of claim 5, wherein therestriction condition includes order relation of the elements in thefirst and the second point sets that are ordered.
 7. A method of claim5, wherein the restriction condition includes proximity in a geometricrelationship among a plurality of elements close to each other.
 8. Amethod of claim 6 wherein the restriction condition includes proximityin a geometric relationship among a plurality of elements close to eachother.
 9. A method of claim 5, wherein the restriction conditionincludes a condition such that a candidate for the combination ofcorrespondence satisfies a threshold value condition.
 10. A method ofclaim 6, wherein the restriction condition includes a condition suchthat a candidate for the combination of correspondence satisfies athreshold value condition.
 11. A method of claim 5, wherein therestriction condition includes a condition such that an attribute valueof each of the elements belonging to the first point set coincides withan attribute value of the corresponding element belonging to the secondpoint set in a candidate for the combination of correspondence.
 12. Amethod of claim 6, wherein the restriction condition includes acondition such that an attribute value of each of the elements belongingto the first point set coincides with an attribute value of thecorresponding element belonging to the second point set in a candidatefor combination of correspondence.
 13. A method of analyzingthree-dimensional structures including a first structure expressed bythree-dimensional coordinates of elements belonging to a first point setand a second structure expressed by three-dimensional coordinates ofelements belonging to a second point set, comprising the steps of: a)dividing the second point set into a plurality of subsets having a sizethat is determined by the size of the first point set; b) generating acombination of correspondence satisfying a restriction condition betweenthe elements belonging to the first point set and the elements belongingto each of the subsets of the second point set from among all candidatesfor the combination of correspondence; and c) calculating a root meansquare distance between the elements corresponding in the combination ofcorrespondence generated in the step b).
 14. A method of claim 13,wherein the second point set is divided into the subsets so that thenumber of elements belonging to each of the subsets is a function of thenumber of elements belonging to the first point set.
 15. A method ofclaim 13, wherein the second point set is divided into the subsets sothat a spatial size of each of the subsets is nearly equal to a spatialsize of the first point set.
 16. A method of analyzing three-dimensionalstructures including a first structure expressed by three-dimensionalcoordinates of elements belonging to a first point set and a secondstructure expressed by three-dimensional coordinates of elementsbelonging to a second point set, comprising the steps of: a) dividingthe first point set and second point set into first subsets and secondsubsets, respectively, according to a secondary structure exhibited bythe three-dimensional coordinates of the elements of the first and thesecond point sets; b) generating a combination of correspondencesatisfying a first restriction condition between the first subsets andthe second subsets from among candidates for the combination ofcorrespondence; c) determining an optimum correspondence between theelements belonging to each pair of subsets corresponding in thecombination of correspondence generated in the step b), and d)calculating a root mean square distance between all of the elementscorresponding in the optimum correspondence in the step c).
 17. A methodof claim 16, wherein the optimum correspondence determining stepcomprising the substeps of: i) generating a combination ofcorrespondence satisfying a second restriction condition between theelements belonging to the subsets corresponding in the combination ofthe correspondence generated in the step b); ii) calculating a root meansquare distance between the elements corresponding in the combination ofthe correspondence generated in the substep i); iii) selecting acombination of the correspondence as the optimum correspondenceaccording to the value of the root mean square distance value calculatedin the substep ii).
 18. An apparatus for analyzing sequences of atomicgroups including a first sequence having m atomic groups and a secondsequence having n atomic groups where m and n are integers, comprising:means for preparing an array S[i] having array elements S[0] to S[m];means for initializing all array elements of the array S[i] to zero andinitializing an integer j to l; means for renewing the array S[i] byadding 1 to each array element S[i] that is equal to an array elementS[r] and that i≧r if the array element S[r] is equal to an array elementS[r−1] where r is an occurrence position of j−th atomic group of thesecond sequence in the first sequence; means for incrementing theinteger j by 1; means for repeatedly activating the renewing means andthe incrementing means until the integer j exceeds n; and means forobtaining a longest common atomic group number between the first and thesecond sequences from a value of the array element S[m].
 19. Anapparatus of claim 18, further comprising: means for preparing an arraydata[k] having array elements data[0], data[1]. . . ; means for storingpaired data (r, j) in an array element data [k] if the array elementS[i] is changed by the renewing means where k=S[r]; means for linkingthe paired data (r, j) stored by the storing means to paired data (r′,j′) if r′<r and j′<j where the paired data (r′, j′) is one stored in anarray element data [k−1]; and means for obtaining a longest commonsubsequence between the first and the second sequences and occurrencepositions of the longest common subsequence in the first and the secondsequence by tracing the link formed by the linking means.
 20. Anapparatus of claim 18, further comprising means for evaluating homologybetween the first and the second sequences based on the longest commonatomic group number and a value of one of m and n.
 21. An apparatus foranalyzing three-dimensional structures including a first structureexpressed by three-dimensional coordinates of elements belonging to afirst point set and a second structure expressed by three-dimensionalcoordinates of elements belonging to a second point set, comprising:means for generating a combination of correspondence satisfying arestriction condition between the elements belonging to the first pointset and the elements belonging to the second point set from among allcandidates for the combination of correspondence; and means forcalculating a root mean square distance between the elementscorresponding in the combination of correspondence generated by thegenerating means.
 22. An apparatus for analyzing three-dimensionalstructures including a first structure expressed by three-dimensionalcoordinates of elements belonging to a first point set and a secondstructure expressed by three-dimensional coordinates of elementsbelonging to a second point set, comprising the steps of: means fordividing the second point set into a plurality of subsets having a sizethat is determined by the size of the first point set; means forgenerating a combination of correspondence satisfying a restrictioncondition between the elements belonging to the first point set and theelements belonging to each of the subsets of the second point set fromamong all candidates for the combination of correspondence.; and meansfor calculating a root mean square distance between the elementscorresponding in the combination of correspondence generated by thegenerating means.
 23. An apparatus for analyzing three-dimensionalstructures including a first structure expressed by three-dimensionalcoordinates of elements belonging to a first point set and a secondstructure expressed by three-dimensional coordinates of elementsbelonging to a second point set, comprising: means for dividing thefirst point set and the second point set into first subsets and secondsubsets, respectively, according to a secondary structure exhibited bythe three-dimensional coordinates of the elements of the first and thesecond point sets; means for generating a combination of correspondencesatisfying a first restriction condition between the first subsets andthe second subsets from among candidates for the combination ofcorrespondence; means for determining an optimum correspondence betweenthe elements belonging to each pair of subsets corresponding in thecombination of correspondence generated in the generating means, andmeans for calculating a root mean square distance between all of theelements corresponding in the optimum correspondence.